Wine Quality Analysis¶
Created 17-Oct-2024 Mark A. Goforth, Ph.D.¶
Purpose¶
This notebook is designed for EDA and train a DNN model to perform a quality estimation of wine by it's chemical composition.
Goal¶
Challenges & Discussion¶
General Steps for Approach¶
Download data
- wine quality data is downloaded from kaggle
EDA
- identify independent variables that influence the outcome
Feature Engineering
- normalize and standardize independent variables as necessary
- reduce dimensionality
Train/Test Split
- split data for training and final testing to see how performance will be in the real world
- use random shuffle and stratified split to preserve proportions of classes
Model Selection, Cross Validation, and Tuning
- use K-fold cross validation to reduce bias, build more generalized model and prevent overfitting
- apply hyperparameter tuning to search for best settings that provide improved bias and variance
Model Validation
- run model on test set to see how model will perform on real world data
Create GAN (TBD)
- create a Generative Adversarial Network (GAN) deep learning architecture
- train two neural networks to compete against each other to generate more authentic new data from a given training dataset
Create VAE (TBD)
- create a Variational Autoencoder (VAE) deep learning architecture
- train neural network to use in anomaly detection
Conclusion¶
In [ ]:
# install any necessary python packages
!pip install kagglehub
In [ ]:
!pip install tensorflow
In [ ]:
!pip install keras_tuner
Import Libraries¶
In [1]:
import datetime
import time
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import pylab as plt
from IPython.display import Image
from IPython.core.display import HTML
from pylab import rcParams
import sklearn
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn import datasets
import kagglehub
import ppscore as pps
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import pickle
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, losses
from matplotlib import pyplot as plt
import keras_tuner
import keras
2024-10-20 12:28:59.926457: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Download latest dataset version¶
In [2]:
pathstr = kagglehub.dataset_download("adarshde/wine-quality-dataset")
print("Path to dataset files:", pathstr)
df = pd.read_csv(pathstr+'/winequality-dataset_updated.csv')
df = df.drop_duplicates()
Path to dataset files: /Users/Mark/.cache/kagglehub/datasets/adarshde/wine-quality-dataset/versions/3
Exploratory Data Analysis (EDA)¶
In [3]:
df.head()
Out[3]:
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.3 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.2 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1760 entries, 0 to 1998 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1760 non-null float64 1 volatile acidity 1760 non-null float64 2 citric acid 1760 non-null float64 3 residual sugar 1760 non-null float64 4 chlorides 1760 non-null float64 5 free sulfur dioxide 1760 non-null float64 6 total sulfur dioxide 1760 non-null float64 7 density 1760 non-null float64 8 pH 1760 non-null float64 9 sulphates 1760 non-null float64 10 alcohol 1760 non-null float64 11 quality 1760 non-null int64 dtypes: float64(11), int64(1) memory usage: 178.8 KB
In [5]:
df.describe().T.style.background_gradient(axis=0)
Out[5]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| fixed acidity | 1760.000000 | 8.710455 | 2.293976 | 4.600000 | 7.100000 | 8.000000 | 10.000000 | 15.900000 |
| volatile acidity | 1760.000000 | 0.545045 | 0.183404 | 0.120000 | 0.400000 | 0.540000 | 0.660000 | 1.580000 |
| citric acid | 1760.000000 | 0.244261 | 0.180000 | 0.000000 | 0.110000 | 0.190000 | 0.380000 | 1.000000 |
| residual sugar | 1760.000000 | 3.844392 | 3.424476 | 0.900000 | 2.000000 | 2.400000 | 3.800000 | 15.990000 |
| chlorides | 1760.000000 | 0.074782 | 0.050203 | 0.010000 | 0.050000 | 0.074000 | 0.086000 | 0.611000 |
| free sulfur dioxide | 1760.000000 | 20.788636 | 16.118756 | 1.000000 | 9.000000 | 16.000000 | 28.000000 | 72.000000 |
| total sulfur dioxide | 1760.000000 | 53.722443 | 37.795090 | 6.000000 | 24.000000 | 44.000000 | 75.000000 | 289.000000 |
| density | 1760.000000 | 0.996411 | 0.002118 | 0.990070 | 0.995200 | 0.996550 | 0.997800 | 1.003690 |
| pH | 1760.000000 | 3.286381 | 0.286839 | 2.340000 | 3.170000 | 3.300000 | 3.420000 | 4.160000 |
| sulphates | 1760.000000 | 0.989398 | 0.821606 | 0.330000 | 0.560000 | 0.660000 | 0.870000 | 3.990000 |
| alcohol | 1760.000000 | 10.711487 | 1.411144 | 8.400000 | 9.500000 | 10.400000 | 11.500000 | 15.000000 |
| quality | 1760.000000 | 5.627841 | 1.312301 | 2.000000 | 5.000000 | 6.000000 | 6.000000 | 9.000000 |
Attribute Information¶
| Feature | Explain |
|---|---|
| fixed acidity | most acids involved with wine or fixed or nonvolatile |
| volatile acidity | the amount of acetic acid in wine |
| citric acid | the amount of citric acid in wine |
| residual sugar | the amount of sugar remaining after fermentation stops |
| chlorides | the amount of salt in the wine |
| free sulfur dioxide | the amount of free sulfur dioxide in the wine(those available to react and thus exhibit both germicidal and antioxidant properties) |
| total sulfur dioxide | amount of free and bound forms of SO2 |
| density | the measurement of how tightly a material is packed together |
| PH | describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 |
| Alcohol | the percent alcohol content of the wine |
| quality | output variable (based on sensory data, score between 3 and 8) |
check for missing values¶
In [7]:
df.isna().sum()
Out[7]:
fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
Visualization - create histograms for each independent variable¶
In [8]:
for i in df.columns:
plt.figure(figsize=(6, 4))
sns.histplot(data=df[i])
plt.title(f'{i}')
plt.tight_layout()
plt.show()
Visualization - create box plots¶
In [9]:
columns = list(df.columns)
fig, ax = plt.subplots(11, 2, figsize=(15, 45))
plt.subplots_adjust(hspace = 0.5)
for i in range(11) :
# AX 1
sns.boxplot(x=columns[i], data=df, ax=ax[i, 0])
# Ax 2
sns.scatterplot(x=columns[i], y='quality', data=df, hue='quality', ax=ax[i, 1])
compare each dependent variable with quality using box plots¶
In [11]:
for i in df.columns:
if i != 'quality':
plt.figure(figsize=(6, 4)) # Set figure size for each plot
sns.boxplot(data=df, x='quality', y= i)
plt.title(f'Box plot for quality and {i}')
plt.tight_layout()
plt.show()
In [12]:
for i in df.columns:
if i != 'quality':
plt.figure(figsize=(6, 4))
sns.violinplot(data=df, x='quality', y=i)
plt.title(f'Violin plot for {i} by Quality')
plt.tight_layout()
plt.show()
Correlate each dependent variable with quality¶
In [13]:
%matplotlib inline
rcParams['figure.figsize'] = 12, 10
sns.set_style('whitegrid')
In [14]:
# Plotting the correlation heatmap
dataplot = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True, annot_kws={"size": 12})
# Displaying heatmap
plt.show()
In [15]:
rcParams['figure.figsize'] = 15, 15
sns.pairplot(df, hue='quality', corner = True, palette='Blues')
Out[15]:
<seaborn.axisgrid.PairGrid at 0x177f31eb0>
In [16]:
# Plot the top N components
dfc = df.corr().iloc[:-1,-1:].sort_values(by='quality', ascending=True)
# dfc = df.corr().iloc[-1:,:-1].sort_values(by='quality', ascending=False).transpose()
# dfc = dfc.set_index('Source').rename_axis(None)
# dfc = df.corr().iloc[:-1,-1:].sort_values(by='quality', ascending=False).transpose()
# type(dfc)
# dfc.loc['quality'].plot(kind='bar', figsize=(10,4) )
dfc.plot.barh(figsize=(10,4) )
Out[16]:
<Axes: >
Prepare data for machine learning training¶
In [17]:
X = df.drop('quality', axis=1)
variable_names = X.columns
In [18]:
variable_names
Out[18]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol'],
dtype='object')
In [19]:
X.head()
Out[19]:
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.3 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 |
| 4 | 7.2 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 |
In [20]:
pca = decomposition.PCA()
wine_pca = pca.fit_transform(X)
explained_variance = pca.explained_variance_ratio_
In [21]:
comps = pd.DataFrame(pca.components_, columns=variable_names)
comps
Out[21]:
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.002627 | 0.000503 | -0.000344 | 0.028497 | -0.000205 | 0.244696 | 0.969158 | -0.000004 | -0.000523 | 0.005899 | 0.001288 |
| 1 | 0.016668 | 0.000230 | -0.002320 | 0.081072 | -0.000844 | 0.965255 | -0.246279 | -0.000019 | -0.001673 | 0.017862 | 0.021218 |
| 2 | 0.321359 | 0.000777 | 0.000026 | 0.928018 | -0.004029 | -0.089868 | -0.006305 | 0.000001 | -0.011790 | 0.109290 | 0.123671 |
| 3 | -0.943642 | 0.011013 | -0.033403 | 0.321132 | -0.000738 | -0.012454 | -0.003658 | -0.000299 | 0.027195 | -0.027042 | 0.059466 |
| 4 | 0.009065 | -0.011031 | 0.004401 | -0.146875 | -0.007204 | -0.010019 | 0.004919 | -0.000524 | 0.001574 | 0.099087 | 0.983975 |
| 5 | -0.061430 | 0.020873 | -0.042513 | -0.081280 | -0.005116 | -0.008396 | -0.001221 | -0.000298 | -0.012820 | 0.987354 | -0.110666 |
| 6 | 0.035398 | 0.118715 | -0.174634 | -0.000466 | -0.022120 | 0.000611 | 0.000135 | -0.000363 | 0.976547 | 0.004652 | -0.000471 |
| 7 | 0.023222 | 0.787008 | -0.580439 | -0.008426 | -0.028641 | -0.001187 | -0.000008 | -0.001059 | -0.200759 | -0.042043 | 0.014280 |
| 8 | -0.017816 | 0.602446 | 0.783734 | 0.004263 | 0.130421 | 0.001325 | -0.000519 | 0.003855 | 0.070416 | 0.022155 | 0.002678 |
| 9 | 0.004173 | -0.053869 | -0.124076 | 0.001785 | 0.990743 | 0.000187 | 0.000063 | 0.003171 | 0.006645 | 0.002232 | 0.007200 |
| 10 | 0.000203 | 0.001271 | 0.003326 | 0.000036 | 0.003688 | -0.000001 | -0.000002 | -0.999987 | 0.000145 | -0.000203 | -0.000483 |
In [23]:
rcParams['figure.figsize'] = 10, 10
sns.heatmap(comps, cmap='Blues', annot=True )
Out[23]:
<Axes: >
In [24]:
# Plot the top N components
maxcol = np.argmax(pca.components_, axis=1)
n_components = 5 # Number of top components to display
rcParams['figure.figsize'] = 10, 4
plt.bar(range(0, n_components ), explained_variance[:n_components])
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Top {} Principal Components'.format(n_components))
plt.xticks(np.arange(5), variable_names[maxcol[0:5]])
plt.show()
In [25]:
ppscore_list = [pps.score(df, colName, 'quality') for colName in variable_names]
df_pp_score = pd.DataFrame(ppscore_list).sort_values('ppscore', ascending=False)
# df_pp_score
In [26]:
df_pp_score
Out[26]:
| x | y | ppscore | case | is_valid_score | metric | baseline_score | model_score | model | |
|---|---|---|---|---|---|---|---|---|---|
| 10 | alcohol | quality | 0.004587 | regression | True | mean absolute error | 0.973295 | 0.968831 | DecisionTreeRegressor() |
| 0 | fixed acidity | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.000633 | DecisionTreeRegressor() |
| 1 | volatile acidity | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.000094 | DecisionTreeRegressor() |
| 2 | citric acid | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 0.975877 | DecisionTreeRegressor() |
| 3 | residual sugar | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.129688 | DecisionTreeRegressor() |
| 4 | chlorides | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.010453 | DecisionTreeRegressor() |
| 5 | free sulfur dioxide | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.018520 | DecisionTreeRegressor() |
| 6 | total sulfur dioxide | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.036547 | DecisionTreeRegressor() |
| 7 | density | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.173801 | DecisionTreeRegressor() |
| 8 | pH | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.065762 | DecisionTreeRegressor() |
| 9 | sulphates | quality | 0.000000 | regression | True | mean absolute error | 0.973295 | 1.053535 | DecisionTreeRegressor() |
In [27]:
ax = df_pp_score.plot.barh(x='x',y='ppscore')
ax.set_xscale('log')
normalize data¶
In [28]:
# Create X from DataFrame and y as Target
X_temp = df.drop(columns='quality')
y = df.quality
In [29]:
scaler = MinMaxScaler(feature_range=(0, 1)).fit_transform(X_temp)
X = pd.DataFrame(scaler, columns=X_temp.columns)
X.describe().T.style.background_gradient(axis=0, cmap='Blues')
Out[29]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| fixed acidity | 1760.000000 | 0.363757 | 0.203007 | 0.000000 | 0.221239 | 0.300885 | 0.477876 | 1.000000 |
| volatile acidity | 1760.000000 | 0.291127 | 0.125619 | 0.000000 | 0.191781 | 0.287671 | 0.369863 | 1.000000 |
| citric acid | 1760.000000 | 0.244261 | 0.180000 | 0.000000 | 0.110000 | 0.190000 | 0.380000 | 1.000000 |
| residual sugar | 1760.000000 | 0.195122 | 0.226937 | 0.000000 | 0.072896 | 0.099404 | 0.192180 | 1.000000 |
| chlorides | 1760.000000 | 0.107791 | 0.083533 | 0.000000 | 0.066556 | 0.106489 | 0.126456 | 1.000000 |
| free sulfur dioxide | 1760.000000 | 0.278713 | 0.227025 | 0.000000 | 0.112676 | 0.211268 | 0.380282 | 1.000000 |
| total sulfur dioxide | 1760.000000 | 0.168631 | 0.133552 | 0.000000 | 0.063604 | 0.134276 | 0.243816 | 1.000000 |
| density | 1760.000000 | 0.465591 | 0.155540 | 0.000000 | 0.376652 | 0.475771 | 0.567548 | 1.000000 |
| pH | 1760.000000 | 0.519989 | 0.157604 | 0.000000 | 0.456044 | 0.527473 | 0.593407 | 1.000000 |
| sulphates | 1760.000000 | 0.180163 | 0.224483 | 0.000000 | 0.062842 | 0.090164 | 0.147541 | 1.000000 |
| alcohol | 1760.000000 | 0.350225 | 0.213810 | 0.000000 | 0.166667 | 0.303030 | 0.469697 | 1.000000 |
total count for each label¶
In [30]:
df.quality.value_counts()
Out[30]:
5 632 6 575 7 233 4 98 3 60 9 60 8 58 2 44 Name: quality, dtype: int64
In [31]:
# Convert labels to one-hot encoding
y = tf.keras.utils.to_categorical(y)
In [32]:
# Split Dataframe
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
In [33]:
y_train
Out[33]:
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
In [34]:
#------------------------------------------------------------------------------
# hyperparameter tuning function
#------------------------------------------------------------------------------
def build_model(hp):
print(hp)
inputs = keras.Input(shape=(11, 11, 9))
x = inputs
n_layers = hp.Int( "n_layers", 2, 24 )
nodeunits = hp.Int( 'units', 4, 32 )
dropout =hp.Float( "dropout",0,0.25)
learning_rate = hp.Float( "learning_rate", 0.00001, 10 )
# batch_size = hp.Float( "batch_size", 4, 64 )
optimizer = hp.Choice( "optimizer", ["adam", "adamax"] )
# model = keras.Sequential()
# model.add(keras.layers.Dense(
# hp.Choice('units', [8, 16, 32]),
# activation='relu'))
# model.add(keras.layers.Dense(1, activation='relu'))
# model.compile(loss='mse')
#--------------------------------------
# configure model
#--------------------------------------
# number of hidden layers
# number of neurons
# activation function (relu)
# output layer (sigmoid for binary classification; softmax for binary or multiclass)
# initialize ANN
ann = tf.keras.models.Sequential()
# add hidden layers
for i in range(n_layers):
ann.add(tf.keras.layers.Dense(nodeunits,activation="relu"))
if dropout > 0:
x = layers.Dropout(dropout)(x)
# create output layer (number of units = number of classes)
# ann.add(tf.keras.layers.Dense(units=1,activation="sigmoid"))
ann.add(tf.keras.layers.Dense(units=10,activation="softmax"))
# compile model
# optimizers:
# Adam (good), AdamW, Adadelta, Adagrad, Adamax (good), Nadam, Ftrl (bad), Lion (very noise loss), SGD (good but takes alot of epochs)
if optimizer == "adamax":
opt = tf.keras.optimizers.Adamax(learning_rate)
else:
opt = tf.keras.optimizers.Adam(learning_rate)
# loss function
# mse, binary_crossentropy, categorical_crossentropy
# metrics (accuracy)
ann.compile(optimizer=opt,loss="categorical_crossentropy",metrics=['accuracy'])
return ann
In [35]:
# train DNN model
# run start time
print("start time: "+str(datetime.datetime.now()))
starttime = time.time()
numtrials = 200
numepochs = 25
# fitout = ann.fit( X_train, Y_train, batch_size=batchsize, validation_data=(X_test,Y_test), epochs=numepochs )
hp = keras_tuner.HyperParameters()
# hp.values["model_type"] =
hp.Float(
"learning_rate",
min_value=0.0001,
max_value=0.1,
sampling="log" )
hp.Int(
"n_layers",
min_value=2,
max_value=4 )
hp.Int(
"units",
min_value=11,
max_value=11 )
hp.Float(
"dropout",
min_value=0.0,
max_value=0.05 )
hp.Int(
"batch_size",
min_value=4,
max_value=32 )
hp.Choice(
"optimizer",
["adam"] )
# hyperparameter tuning
dts = str(datetime.datetime.now().isoformat(timespec="seconds"))
dts = dts.replace(":","")
pathout = "./tuner_"+dts
print("output path: "+pathout)
# tuner = keras_tuner.RandomSearch(
tuner = keras_tuner.BayesianOptimization(
build_model,
objective='val_loss', # val_accuracy val_loss
max_trials=numtrials,
directory=pathout,
hyperparameters=hp )
tuner.search( X_train, y_train, epochs=numepochs, validation_data=(X_test,y_test))
tuner.search_space_summary()
tuner.results_summary()
print( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + " runtime: " + str(round(time.time()-starttime,3)) + " seconds" )
Trial 200 Complete [00h 00m 06s]
val_loss: 1.4633156061172485
Best val_loss So Far: 1.1767619848251343
Total elapsed time: 00h 17m 25s
Search space summary
Default search space size: 6
learning_rate (Float)
{'default': 0.0001, 'conditions': [], 'min_value': 0.0001, 'max_value': 0.1, 'step': None, 'sampling': 'log'}
n_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 4, 'step': 1, 'sampling': 'linear'}
units (Int)
{'default': None, 'conditions': [], 'min_value': 11, 'max_value': 11, 'step': 1, 'sampling': 'linear'}
dropout (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.05, 'step': None, 'sampling': 'linear'}
batch_size (Int)
{'default': None, 'conditions': [], 'min_value': 4, 'max_value': 32, 'step': 1, 'sampling': 'linear'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam'], 'ordered': False}
Results summary
Results in ./tuner_2024-10-20T123331/untitled_project
Showing 10 best trials
Objective(name="val_loss", direction="min")
Trial 094 summary
Hyperparameters:
learning_rate: 0.01313017789171866
n_layers: 3
units: 11
dropout: 0.019680902684635102
batch_size: 18
optimizer: adam
Score: 1.1767619848251343
Trial 077 summary
Hyperparameters:
learning_rate: 0.037046280053398446
n_layers: 3
units: 11
dropout: 0.013766438665671033
batch_size: 16
optimizer: adam
Score: 1.178104281425476
Trial 064 summary
Hyperparameters:
learning_rate: 0.014235594870374568
n_layers: 3
units: 11
dropout: 0.014699279876672576
batch_size: 16
optimizer: adam
Score: 1.180485725402832
Trial 174 summary
Hyperparameters:
learning_rate: 0.03505231393488723
n_layers: 4
units: 11
dropout: 0.028969137670465796
batch_size: 19
optimizer: adam
Score: 1.180999517440796
Trial 048 summary
Hyperparameters:
learning_rate: 0.042793137162632534
n_layers: 3
units: 11
dropout: 0.0067480563266899985
batch_size: 20
optimizer: adam
Score: 1.1811329126358032
Trial 041 summary
Hyperparameters:
learning_rate: 0.03990196699511094
n_layers: 3
units: 11
dropout: 0.010040175924648448
batch_size: 21
optimizer: adam
Score: 1.181243896484375
Trial 005 summary
Hyperparameters:
learning_rate: 0.0251291873419375
n_layers: 2
units: 11
dropout: 0.007240227956324658
batch_size: 13
optimizer: adam
Score: 1.1818410158157349
Trial 083 summary
Hyperparameters:
learning_rate: 0.04050796566630129
n_layers: 3
units: 11
dropout: 0.016593980314083344
batch_size: 15
optimizer: adam
Score: 1.182966947555542
Trial 136 summary
Hyperparameters:
learning_rate: 0.06017700513596118
n_layers: 3
units: 11
dropout: 0.014104510639885154
batch_size: 28
optimizer: adam
Score: 1.1832951307296753
Trial 069 summary
Hyperparameters:
learning_rate: 0.019199806108826153
n_layers: 2
units: 11
dropout: 0.010091488143066733
batch_size: 17
optimizer: adam
Score: 1.1835540533065796
2024-10-20 12:50:56 runtime: 1045.087 seconds
In [36]:
# return the best hyperparameters
best_hp = tuner.get_best_hyperparameters()[0]
ann = tuner.hypermodel.build(best_hp)
<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters object at 0x1870c60c0>
In [37]:
# select the best model
best_model = tuner.get_best_models()[0]
best_model.summary()
<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters object at 0x1870c60c0>
/opt/anaconda3/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:719: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 18 variables. saveable.load_own_variables(weights_store.get(inner_path))
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 11) │ 132 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 11) │ 132 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 11) │ 132 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 10) │ 120 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 516 (2.02 KB)
Trainable params: 516 (2.02 KB)
Non-trainable params: 0 (0.00 B)
In [38]:
numepochs = 30
# fitout = ann.fit( X_train, Y_train, batch_size=batchsize, validation_data=(X_test,Y_test), epochs=numepochs )
fitout = ann.fit( X_train, y_train, validation_data=(X_test,y_test), epochs=numepochs )
# save model
modelfilename = "ANN.keras"
ann.save(modelfilename)
# load model from file
# ann = models.load_model(modelfilename)
# print metrics
ann.summary()
Epoch 1/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.2670 - loss: 1.9892 - val_accuracy: 0.3659 - val_loss: 1.4808 Epoch 2/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.3236 - loss: 1.5679 - val_accuracy: 0.3636 - val_loss: 1.3920 Epoch 3/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3861 - loss: 1.4584 - val_accuracy: 0.4136 - val_loss: 1.3231 Epoch 4/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4487 - loss: 1.3159 - val_accuracy: 0.4614 - val_loss: 1.2638 Epoch 5/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4401 - loss: 1.3395 - val_accuracy: 0.4455 - val_loss: 1.3174 Epoch 6/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4725 - loss: 1.3255 - val_accuracy: 0.4705 - val_loss: 1.2288 Epoch 7/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4831 - loss: 1.3172 - val_accuracy: 0.4886 - val_loss: 1.2574 Epoch 8/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4457 - loss: 1.3168 - val_accuracy: 0.4795 - val_loss: 1.2465 Epoch 9/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4860 - loss: 1.2478 - val_accuracy: 0.4864 - val_loss: 1.2369 Epoch 10/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4792 - loss: 1.2282 - val_accuracy: 0.4841 - val_loss: 1.2606 Epoch 11/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4627 - loss: 1.2766 - val_accuracy: 0.4818 - val_loss: 1.2406 Epoch 12/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4986 - loss: 1.2991 - val_accuracy: 0.4932 - val_loss: 1.2345 Epoch 13/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4868 - loss: 1.2399 - val_accuracy: 0.4591 - val_loss: 1.2583 Epoch 14/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4662 - loss: 1.2786 - val_accuracy: 0.4750 - val_loss: 1.2593 Epoch 15/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4867 - loss: 1.2257 - val_accuracy: 0.4841 - val_loss: 1.2203 Epoch 16/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4908 - loss: 1.2727 - val_accuracy: 0.4455 - val_loss: 1.3006 Epoch 17/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4782 - loss: 1.2941 - val_accuracy: 0.4750 - val_loss: 1.2212 Epoch 18/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4706 - loss: 1.3481 - val_accuracy: 0.4909 - val_loss: 1.2204 Epoch 19/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4933 - loss: 1.2386 - val_accuracy: 0.4727 - val_loss: 1.2090 Epoch 20/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5180 - loss: 1.1936 - val_accuracy: 0.4841 - val_loss: 1.2058 Epoch 21/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5077 - loss: 1.2327 - val_accuracy: 0.4795 - val_loss: 1.2079 Epoch 22/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4789 - loss: 1.2774 - val_accuracy: 0.4773 - val_loss: 1.2189 Epoch 23/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4861 - loss: 1.2328 - val_accuracy: 0.4818 - val_loss: 1.2130 Epoch 24/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5063 - loss: 1.2276 - val_accuracy: 0.4659 - val_loss: 1.2262 Epoch 25/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4865 - loss: 1.2454 - val_accuracy: 0.4886 - val_loss: 1.1995 Epoch 26/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5258 - loss: 1.1798 - val_accuracy: 0.4886 - val_loss: 1.2104 Epoch 27/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4742 - loss: 1.2574 - val_accuracy: 0.4909 - val_loss: 1.2127 Epoch 28/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.4778 - loss: 1.2222 - val_accuracy: 0.4705 - val_loss: 1.2176 Epoch 29/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4811 - loss: 1.2224 - val_accuracy: 0.4591 - val_loss: 1.2598 Epoch 30/30 42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4828 - loss: 1.2104 - val_accuracy: 0.4955 - val_loss: 1.2198
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense_4 (Dense) │ (None, 11) │ 132 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_5 (Dense) │ (None, 11) │ 132 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_6 (Dense) │ (None, 11) │ 132 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_7 (Dense) │ (None, 10) │ 120 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,550 (6.06 KB)
Trainable params: 516 (2.02 KB)
Non-trainable params: 0 (0.00 B)
Optimizer params: 1,034 (4.04 KB)
In [39]:
# accuracy metrics
history = fitout.history
acc = history['accuracy']
loss = history['loss']
val_acc = history['val_accuracy']
val_loss = history['val_loss']
print("final train accuracy: "+str(acc[-1]))
print("final train loss : "+str(loss[-1]))
print("final val accuracy: "+str(val_acc[-1]))
print("final val loss : "+str(val_loss[-1]))
print( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + " runtime: " + str(round(time.time()-starttime,3)) + " seconds" )
final train accuracy: 0.4856060743331909 final train loss : 1.2303400039672852 final val accuracy: 0.4954545497894287 final val loss : 1.219757080078125 2024-10-20 12:51:56 runtime: 1105.25 seconds
In [40]:
epochs_range = range(numepochs)
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.plot( epochs_range, acc, label='Training Accuracy' )
plt.plot( epochs_range, val_acc, label='Validation Accuracy' )
plt.legend( loc='lower right' )
plt.ylim(0,1)
plt.title('Training and Validation Accuracy', fontsize=15 )
plt.subplot(1,2,2)
plt.plot( epochs_range, loss, label='Training Loss' )
plt.plot( epochs_range, val_loss, label='Validation Loss' )
plt.legend( loc='upper right' )
# plt.ylim(0,1)
plt.title('Training and Validation Loss', fontsize=15 )
plt.show()
Run inference on test data to validate performance¶
In [41]:
# define a function to ploting Confusion matrix
def plot_confusion_matrix(y_test, y_prediction):
'''Plotting Confusion Matrix'''
cm = metrics.confusion_matrix(y_test, y_prediction)
ax = plt.subplot()
ax = sns.heatmap(cm, annot=True, fmt='', cmap="Blues")
ax.set_xlabel('Prediced labels', fontsize=18)
ax.set_ylabel('True labels', fontsize=18)
ax.set_title('Confusion Matrix', fontsize=25)
ax.xaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
ax.yaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
plt.show()
In [42]:
# define a function to ploting Classification report
def clfr_plot(y_test, y_pred) :
''' Plotting Classification report'''
cr = pd.DataFrame(metrics.classification_report(y_test, y_pred_rf, digits=3,
output_dict=True)).T
cr.drop(columns='support', inplace=True)
sns.heatmap(cr, cmap='Blues', annot=True, linecolor='white', linewidths=0.5).xaxis.tick_top()
In [43]:
def clf_plot(y_test, y_pred) :
'''
1) Ploting Confusion Matrix
2) Plotting Classification Report
'''
y_predmax = np.argmax(y_pred, axis=1)
y_testmax = np.argmax(y_test, axis=1)
# metrics.f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))
# metrics.f1_score(y_test, y_pred, average='weighted',zero_division=0)
cm = metrics.confusion_matrix(y_testmax, y_predmax)
cr = pd.DataFrame(metrics.classification_report(y_testmax, y_predmax, digits=3, output_dict=True)).T
cr.drop(columns='support', inplace=True)
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
# Left AX : Confusion Matrix
ax[0] = sns.heatmap(cm, annot=True, fmt='', cmap="Blues", ax=ax[0])
ax[0].set_xlabel('Prediced labels', fontsize=18)
ax[0].set_ylabel('True labels', fontsize=18)
ax[0].set_title('Confusion Matrix', fontsize=25)
# ax[0].xaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
# ax[0].yaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
# Right AX : Classification Report
ax[1] = sns.heatmap(cr, cmap='Blues', annot=True, linecolor='white', linewidths=0.5, ax=ax[1])
ax[1].xaxis.tick_top()
ax[1].set_title('Classification Report', fontsize=25)
plt.show()
In [44]:
# test predict (inference)
y_pred = ann.predict(X_test)
# y_pred = (y_pred > 0.5)
y_predmax = np.argmax(y_pred, axis=1)
y_testmax = np.argmax(y_test, axis=1)
cm = confusion_matrix( y_testmax, y_predmax)
print(cm)
#ann_score = round(ann.score(X_test, y_test), 3)
#print('ANN score : ', ann_score)
clf_plot(y_test, y_pred)
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step [[ 0 1 0 1 0 6 0 5] [ 0 1 0 1 0 2 0 2] [ 0 2 0 9 5 5 0 1] [ 0 0 0 96 50 5 0 9] [ 0 0 0 43 96 21 0 1] [ 0 3 0 1 20 22 0 7] [ 0 1 0 0 2 5 0 6] [ 0 2 0 0 1 5 0 3]]
/opt/anaconda3/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1509: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/opt/anaconda3/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1509: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/opt/anaconda3/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1509: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
In [ ]: